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Abstract 

We revisit the complexity of one of the most basic problems in pattern matching. In the fc-mismatch 
problem we must compute the Hamming distance between a pattern of length m and every m-length 
substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming 
distance is greater than k at some alignment of the pattern and text, we simply output “No”. 

We study this problem in both the standard offline setting and also as a streaming problem. In the 
streaming fc-mismatch problem the text arrives one symbol at a time and we must give an output before 
processing any future symbols. Our main results are as follows: 

• Our first result is a deterministic 0{nk'^\ogk/m + npolylogm) time offline algorithm for k- 
mismatch on a text of length n. This is a factor of k improvement over the fastest previous result 
of this form from SODA 2000 llOllTOl. 

• We then give a randomised and online algorithm which runs in the same time complexity but 
requires only polylog m) space in total. 

• Next we give a randomised (l+e)-approximationalgorithmforthe streaming fc-mismatchproblem 
which uses 0(fc^ polylog m/e^) space and runs in 0(polylog m/e^) worst-case time per arriving 
symbol. 

• Finally we combine our new results to derive a randomised polylog m) space algorithm for 
the streaming /c-mismatch problem which runs in 0{-\fk log k + polylog m) worst-case time per 
arriving symbol. This improves the best previous space complexity for streaming /c-mismatch from 
FOCS 2009 ll26l by a factor of k. We also improve the time complexity of this previous result by 
an even greater factor to match the fastest known offline algorithm (up to logarithmic factors). 


1 Introduction 

We study the complexity of one of the most basic problems in pattern matching. In the A:-mismatch problem 
we are given as input two strings, a pattern of length m and a text of length n. The task is to output 
the Hamming distance between the pattern and every m-length substring of the text where the Hamming 
distance is at most k. If the Hamming distance is greater than k we need only output “No”. We provide new, 
faster and more space efficient solutions for the /c-mismatch problem in both the classic offline setting and 
when considered as an online streaming problem. 

The general task of efficiently computing the Hamming distances between a pattern and a longer text has 
been studied since at least the 1980s when O{n^/rr^ogm) time solutions were first discovered IllllSl. For 
many years however the fastest known algorithm for the fc-mismatch problem ran in 0{nk) time If24ll using 
repeated Lowest Common Ancestor calls to a generalised suffix tree of the pattern and text. Eventually, in 
the year 2000 two improved algorithms were given which run in 0{nk^ \ogk/m + n) and 0{nyJk\ogk) 
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time respectively ||9j[T0l. The former algorithm is clearly preferable when /c/m is relatively small and the 
latter algorithm has superior performance in all other cases. Until this point, these two algorithms remain 
the fastest solutions known. 

Our first result is a new deterministic algorithm for the /c-mismatch problem which is faster than all 
previous solutions when k E 0(m^/^“'^). This is a result of independent interest, providing the fastest 
known /c-mismatch algorithm for a large and particularly natural range of values of the threshold k. 

Theorem 1.1. Given a pattern P of length m and a text T of length n, there is a deterministic solution for 
the k-mismatch problem with run-time 0{nk‘^ log k/m + n poly log m). 

We then turn our attention to a small-space online version of the /c-mismatch problem. In this setting 
the text arrives one symbol at a time and we must output the Hamming distance, if it is at most k, before 
the subsequent symbol arrives. We consider a particularly strong space model where we account for all the 
space used by our algorithm and in particular we are not permitted to store a copy of the pattern or text 
without also accounting for that. We obtain the following result. 

Theorem 1.2. Given a pattern P of length m and a streaming text of total length n arriving one symbol 
at a time, there is a randomised 0{k‘^ polylog m) space online algorithm which runs in 0{nk‘^ log /c/m + 
n polylog m) time and solves the k-mismatch problem. The probability of error is at most 1/m^. 

A particularly attractive feature of this new online algorithm is that whenever k E 0(m^/^“'^), it not 
only uses sublinear space but also has total running time of only 0{n poly log m) time. 

We next consider a small-space approximate version of the /c-mismatch problem. In return for tolerating 
a constant multiplicative error in the output we are able to give an algorithm that runs in polylog m time per 
symbol. We define the (1 -I- e)-approximate /c-mismatch problem as follows. Let y be the true Hamming 
distance at a particular alignment of the pattern and text. At each alignment of the pattern and text, we output 
either an integer x or “No”. If we output “No” then y > k with high probability. If we output an integer x 
then y < X < {1 -\- e)y with high probability. One subtlety with this problem definition is that the two cases 
overlap when k < y < {l-\- e)k. In this case we are free to either output “No” or an integer x. However any 
integer we do output must still be an (1 -|- e)-approximation to the true Hamming distance. This formulation 
is a generalisation of the e-threshold decision problem introduced by Indyk in FOCS 1998 ifT^ where a 
linear space 0{{n/e^) log m) time offline algorithm was given. 

Theorem 1.3. Given a pattern P of length m and a streaming text arriving one symbol at a time, there 
is a randomised 0(/c^ polylog m/e^) space algorithm which takes O(polylog m/e^) worst-case time per 
arriving symbol and solves the (1 -|- e)-approximate k-mismatch problem. The probability of error is at 
most 1/rnf. 

Finally we turn to the streaming /c-mismatch problem itself. Here the text arrives one symbol at a time, 
as in the online model. However a particularly important additional feature is that the performance per 
arriving symbol should be guaranteed worst-case. The analysis of small space streaming algorithms for 
pattern matching problems started in earnest in FOCS 2009 1261 . In that year Porat and Porat presented 
a randomised algorithm for performing exact matching in a stream which only stored O(logm) words 
of space and required O(logm) worse-case time per arriving symbol |[2^ . This result was subsequently 
slightly simplified ifTTl and then eventually improved to take constant time per arriving symbol in 2011 ifTTl . 

Following this early breakthrough, the natural question was to ask for what other pattern matching 
problems is it also possible to find near optimal time and space solutions. Unfortunately, it turns out that for 
a large range of the most popular pattern matching problems, including pattern matching with wildcards, Li, 
L 2 , Loo-distance and edit distance, space proportional to the pattern length is required for any randomised 
online algorithm ifTSll . Despite this, the Porat and Porat paper also presented an algorithm for the streaming 
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/c-mismatch problem that ran in 0{k^ polylog m) space and 0{k‘^ polylog m) time per arriving symbol in 
their original 2009 paper. For small k this is a sublinear space algorithm and it remains to date one of the 
few fast sublinear space algorithms for streaming pattern matching that is known. 

As our final result we use a combination of Theorems 11.21 and 11.31 as the basis for a new worst-case 
time streaming algorithm for the /c-mismatch problem which is not only significantly faster than the result 
of Porat and Porat, but whose time complexity matches (up to logarithmic factors) the fastest known offline 
algorithm. Our method also uses a multiplicative factor of k less space than the previous result of Porat and 
Porat (up to logarithmic factors again) while still guaranteeing that an output is made after each arriving 
symbol and before any future symbol is processed. 

Theorem 1.4. Given a pattern of length m and a streaming text arriving one symbol at a time, there is a 
randomised 0{k^ polylog m) space algorithm which takes 0(\/fclog k + polylog m) worst-case time per 
arriving symbol and solves the k-mismatch problem. The probability of error is at most Xjrr?. 

Each one of our four main results is of independent interest and advances the state of the art for their 
respective problems. However, we regard Theorems 11.11 and 1 1 .41 to be the most significant contributions of 
this paper. The main technical contributions are set out in Section [3] 

2 Related work and lower bounds 

There has been great interest in time and space efficient streaming algorithms over the last 20 years, fol¬ 
lowing the seminal work of fj]. In relation specifically to pattern matching problems, where space is not 
limited but where an output must be computed after every new symbol of the text arrives, the Hamming 
distance between the pattern and the latest suffix of the stream can be computed online in 0{\Jm logm) 
worst-case time per arriving symbol or Oip/klogk -\- logm) time for the /c-mismatch version ifT^ . Both 
these methods however require 0(m) space. Using the same approach, a number of other approximate 
pattern matching algorithms have also been transformed into efficient linear space online algorithms includ¬ 
ing HI m 121 in H m |25l. The only other small space streaming pattern matching algorithm that we are 
aware of solves a problem known as parameterised matching |[20l . In the offline setting, randomised and 
deterministic algorithms that give an (1 -|- e)-approximation to the Hamming distance are also known 11211 . 
The running time of these two algorithms is 0((n/e^) log^ m) and 0{{n/e^) log^ m) respectively. Using 
an existing online to offline reduction |[T4l the (1 -|- e)-approximation algorithms of ||2T1 can be converted 
into 0(m/e^) space online solutions with guaranteed worst case running time per arriving symbol at a 
multiplicative time cost of 0(log m). 

One can derive a space lower bound for any streaming problem by looking at a related one-way com¬ 
munication complexity problem. The randomised one-way communication complexity of determining if 
the Hamming distance between two n bits strings is greater than k is known to be il{k) bits (with an upper 
bound of 0{k\ogk) lITSl . From this we can derive the same lower bound for the space required by any 
streaming /c-mismatch algorithm. The results we present in this paper take us a significant step towards this 
lower bound but it is still unclear how closely it can ultimately be reached. 

3 Overview of the main ideas 

In this section we will give an overview of the main ideas needed to prove Theorems il. 1111.2111.3l and [T4l 

We start by introducing the notion of the approximate period, or a;-period of a string. This idea will be 
crucial for all of our main results. We will in general use the approximate period of the pattern to separate 
our problems into two cases. Let Ham(P, S) be the Hamming distance between equal length strings P and 
S and let Ham(P, T)[i] be Ham(P, T[i — m 1, /]). 


3 





Definitions.!. The x-period of a string P of length m is the smallest integer vr > 0 such that HAM(P[7r, m— 
1], P[0, m — 1 — tt]) < X. (For example, the 1-period of a string babaa is 2.) 

Let £ be the 3A:-period of the pattern P and as our first of two cases, consider when £ < k. We call this 
the small approximate period case and as we will see, the solution for this case contains some of the main 
ideas on which our other results will rely. 

Fact 3.2. If a pattern has 3k-period £ then each {3k/2)-mismatch of the pattern and the text must be at least 
£ symbols apart. 

Small approximate period {£ < k) case of Theorems II. H and II.21 Our solution for the small approximate 
period case is the same for both our offline (see Theorem 11.11) and online small-space (see Theorem 11.21) 
algorithms. The main new idea is to reduce the problem to many instances of run length encoded pattern 
matching. Our solution utilises a simple variant of run length encoding and we will use this encoding to 
reduce the ^-mismatch problem to a total of 0{k‘^) small instances of the run length encoded Hamming 
distance problem. 

There are a number of surprising elements to our solution. The first one is that in any substring of the text 
of length 2m we can find a compressible region that contains all the alignments of the pattern and text with 
Hamming distance at most k. The second is that by choosing a suitable partitioning of the pattern and of 
this compressible region into 0{k) subpatterns and 0{k) subtexts respectively and then run length encoding 
those, we can ensure that the total number of runs, summed across all subpattems and subtexts is only 0{k). 
The third is that despite there being 0{k) subpattems and 0{k) subtexts giving 0{k‘^) instances of the run 
length encoded Hamming distance problem, each of which can take 0{k'^ log k) time, we show that the time 
complexity of all the instances sums to only 0{k‘^ log k). By the same approach, we will demonstrate that 
the working space of all the instances sums to 0{k‘^). We will also need to be careful when recovering the 
final Hamming distances because, in the worst case, each final distance is the sum of k outputs of the run 
length encoded Hamming distance problem. A naive summation would therefore result in an additive n{k) 
term per Hamming distance. To overcome this bottleneck we will take advantage of the compressed output 
to reduce the time taken to recover the final distances to 0(m -|- k'^ log k) per substring. 

Using a standard trick we run our algorithm independently on 0{nlm) substrings of the text of length 
2m, each overlapping the next by m symbols, thus giving Lemma 13.31 The main steps are set out in 
Algorithm [U with additional details and a proof overview set out in Section 0 


Input: Pattern of length m and text of length 2m. 

1. Identify a compressible region of the text which contains all the ^-mismatches. 

2. Partition this region into 0{k) subtexts and the pattern into 0{k) subpatterns. 

3. Run length encode all the subpattems and subtexts. 

4. Compute run length encoded Hamming distances for each subpattem/subtext pair. 

5. Sum the Hamming distances from SteplH 


Algorithm 1; Deterministic algorithm for ^-mismatch when the pattern has small approximate period. 


Lemma 3.3. Consider a pattern P of length m, and a text T of length n arriving online. If the 3k-period 
of P is smaller than k, then the k-mismatch pattern matching problem can be solved in 0{k‘^) space and 
0{nk‘^ log k/m n) time. 
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Large approximate period {i > k) case of Theorems ll.ll and ll.2[ The overall structure of our solutions 
for both Theorems 11.11 and 1 1 .21 when the pattern has large approximate period is the same. We first describe 
the simpler deterministic case which gives us Theorem ll.il 

1. Filter out all alignments of the pattern and text with Hamming distance greater than 3fc/2. We can do 
this by running Karloff’s (1 + e)-approximation algorithm |[^ with e = 1/2, excluding all positions 
which are reported to have Hamming distance greater than 2>k/2. This takes O(log^m) time per 
symbol in the text. 

2. Verify whether the Hamming distance is at most k at those positions. This takes 0{k) time per 
alignment we need to verify using 0{k) repeated application of constant time longest common prefix 
(LCP) queries between the pattern and the suffix of the text starting at the current alignment If24ll . 

We need only run the verification step at alignments that have not been filtered out by the filtering 
step. By Fact 13.21 there can be no more than one such alignment for every k consecutive text symbols that 
arrive. It follows that the total amortised time for the large approximate period case is 0(n polylog m). This 
completes the algorithmic description that establishes Theorem ll.il 

In order to establish Theorem 11.21 for the large approximate period case we will need small-space ver¬ 
sions of both the filtering and verification steps. For the filtering step we set e = 1/2 again and this time 
use Theorem 11.31 which we discuss later. In the same way as in the deterministic case, after filtering the 
verification step will only need to verify at most one potential fe-mismatch per k consecutive text symbols. 
To do this efficiently we maintain a dynamic data structure that allows us to query the Hamming distance 
between P and the latest m-length suffix of the text and will output the exact distance if it is at most k and 
“No” otherwise. Each time a new symbol of the text arrives we perform an update. 

Lemma 3.4. For a given pattern P of length m, and an online text T of length n there is a data structure 
which answers Flamming distance queries as described above and uses 0(/c^ polylog m) space, update 
time O(polylogm), and query time 0(/c polylog m). If the Flamming distance does not exceed 2k, the 
probability of error is at most Ijm^. 

The key technical innovation, which is set out in Lemma 13.41 is that our data structure takes only 
polylog m time to perform an update when a new text symbol arrives if no query is performed at that 
time. We will use this asymmetry in query and update times combined with Fact 13.21 to show Theorem 11.41 

Our solution for Lemma[33] works by first reducing the problem to repeated application of 1-mismatch, 
in a similar fashion to Porat and Porat l[26l and then in turn reducing the 1-mismatch problem to the stream¬ 
ing dictionary matching problem. However, our method differs significantly in technique from the previous 
work both by randomising the first reduction step and then in our second reduction step which allows us to 
perform updates much more quickly than queries. 

(1 -f e) -approximate fc-mismatch - Theorem 11.31 The main new ideas for our approximation algorithm 
are a novel randomised length reduction scheme and a two stage approximation scheme. The general idea 
is as follows. First, during preprocessing we reduce the length of the pattern to be only 0(felog^ m). We 
then overcome a particularly significant technical hurdle by showing how to transform the text in such a way 
that any Hamming distance between the reduced length pattern and transformed text provides a reasonable 
approximation of the corresponding Hamming distance in the original input. Finally we apply an existing 
linear space online (1 -|- e)-approximation algorithm to the reduced length pattern and the transformed text 
to give the final approximate answer. The entire process is repeated independently in parallel a logarithmic 
number of times to improve the error probability. We argue that this approximation of an approximation still 
gives us a (1 -|- e)-approximation to the true Hamming distance at each alignment with good probability. 


5 


Deamortisation using the tail trick - Theorem ll.4[ We can now describe how to deamortise our online 
fe-mismatch algorithm with 0{nk‘^ logk/m + n polylog m) run-time that we gave for Theorem 1 1.21 to give 
us a fast worst-case time streaming algorithm satisfying Theorem 11.41 We first observe that if the pattern 
length m is at most 2k‘^, we can run an existing algorithm lIT^ which will take 0{Vk log k) time per symbol 
and uses linear space, which in this case is 0{k‘^). We now proceed under the assumption that m > 2k‘^. 

To deamortise the algorithm, we use a two part partitioning that we call the tail trick. Similar ideas 
were also used to deamortise streaming pattern matching algorithms in ifTSlfT^ . We partition the pattern 
into two parts: the tail, Pt — the suffix of P of length 2k‘^, and the head, Ph — the prefix of P length 
(m — 2k‘^) . We will compute the current Hamming distance, Ham(P, T)[f] by summing HAM(Pi, r)[i] 
and HAM(P/i,T)[f — 2k‘^]. To compute HAM(Pi,r)[i] we again use the existing linear space online k- 
mismatch algorithm from ifT^ taking 0{y/klog k) time per symbol and 0(A:^) space. 

We also need to make sure that when the z-th symbol of the text, T[i\, arrives, we will have computed 
Ham(P/i, T)[i — 2k‘^] in time. To this end we run the amortised algorithm from Theorem 1 1.21 using pattern 
Ph- However, we cap the run-time at 0(polylog m) per symbol. That is, when T[i] arrives we run polylog m 
steps of the algorithm. Because the algorithm is amortised, it may lag behind the text stream — when T[i] 
arrives, it may still be processing T[i'] for some i' < i. Fortunately, the lag cannot exceed 2k‘^, that is 
at all times i — i' < 2k‘^. This is because we are able to show that while processing any consecutive 
text symbols the total time complexity of the algorithm, summed over those consecutive symbols is upper 
bounded by log k) = 0{k‘^ polylog m). To allow for the lag in the deamortisation process we also 
maintain a buffer containing the most recently arrived 2k‘^ text symbols and the most recent 2k‘^ outputs. 

The space is dominated by the algorithm from Theorem 11.21 which uses 0{k‘^ polylog m) space. The 
time complexity is the sum of the complexities for processing Pt and Ph which is 0{y/k log k -|- polylog m) 
per arriving symbol. 

4 Proof of Lemma |33] - A data structure for /c-mismatch queries 

In this section we give the proof of Lemma 13.41 which explains how we can maintain a small fc-mismatch 
data structure that can be updated very quickly when a text symbol arrives but only computes an output at 
an alignment where a k-mismatch query is performed. The updates take O(polylog m) time and the queries 
take 0(A; polylogm) time. 

The pattern and text partitioning. The dynamic data structure we present here uses a simple, cyclic 
partitioning of the pattern and streaming text. The same partitioning will also be used in Sections [5] and 0 
For an integer q we can partition the pattern P as follows: For each r G [0, q — 1], the subpattem = 
P[r]P[q + r]P[2q -|- r] ... P[[(m — r — \)/q\ ■ q + r]. That is P^’^ contains exactly the positions of P 
that have remainder r modulo q. The text stream can be partitioned into r substreams analogously, i.e. 
j'q,r _ _|_ r]T[lq -|- r] ... for each r G [0, q — 1]. 

When T[i\ arrives in the text stream we refer to the alignment of P and r[z — m -|- 1, z] as the current 
alignment. There is also a natural notion of the current alignment of subpattern P'^’^' with exactly one 
substream for some r' G [0, q — 1]. Consider the positions in P which correspond to positions in 
These positions in P are aligned with IP*^’^! positions in T[z — m -|- 1, z] which in turn all occur in some 
unique . In fact they exactly form the latest |P'^’''| length suffix of the substream T'?’'" . We will refer to 
this alignment as the current alignment of P'^’^ without explicitly referencing . 

A randomised reduction to 1-mismatch queries. We can assume that m > ^ log^ m. Otherwise, we 
can use 0{m) space and still satisfy the conditions for Lemma lT4l In this case we maintain a data structure, 
as described in ifl^ which allows us to perform Longest Common Prefixes calls befween fhe paffem and 
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the latest m-length suffix of the streaming text, each taking constant time. We can see that at most {k + 1) 
Longest Common Prefixes calls are needed to answer a /c-mismatch query and the update time per arriving 
symbol is O(logm). 

We begin by giving a reduction to the 1-mismatch problem. The reduction and the algorithm from 
Section |5] will use the following technical lemma. 

Lemma 4.1. If pi,P 2 are two distinct integers in [l,m] and q is a random prime number in the interval 
[| log^ m, ^ log^ m] where ^ < S < 1, then Pr\pi = p 2 mod q] < It is always assumed, unless 
otherwise stated, that “log ” means log 2 - 

Proof We have ^ log^ m > 17. Applying Corollary 1 from liTlX we obtain that the number of primes in 
the interval log^ m, ^ log^ m] is at least 

(34-2)-fc ^ m |Qg2 ^ 22/j 

i— , 34 fc 1 2 —7 - - > — log m 

log log m) log m 0 

If Pi = p 2 mod q, then q is a prime divisor of \pi — P 2 |- Observe that |pi — P 2 I < m — 1 has at 
most log m distinct prime divisors. Consequently, the probability that q is one of these divisors is at most 

logm _ _6_ |-| 

(32fc/5)logm 32fc- 

We set (5 to 1 and pick log m primes independently and uniformly at random from [| log^ m, ^ log^ m]. 
These are denoted qi,q 2 , , qiogm- Each qj gives a partitioning of P into qj subpatterns P®’'’, and T into qj 
substreams as described above. 

At the current alignment, that is the alignment of P and T[i — m + 1, i], we say that a position in P 
where a mismatch occurs is isolated under qj if the current alignment of some subpattem containing 
that position has exactly one mismatch. We define Xj to be the number of positions in P that are isolated 
mismatches between P and r[i — m + 1, i] under at least one qj. In Lemma lA2l below we demonstrate that 
if the latest Hamming distance is small then it equals X* with high probability. 

Lemma 4.2. //'Ham(P, T)[i] < 2k, then Ham(P, T)[i] = Xj with probability at least 1 — 

Proof. Ham(P, r)[z] = Xj if and only if each mismatch is isolated under qj for at least one j. Let 
M. = {xi,X 2 ,... ,a;| 7 V(|} be the set of mismatches in the current alignment of P and T. Suppose that 
a mismatch Xi is not isolated under qj. It follows that Xi = Xi/ mod qj for some i' 7 ^ i. By Lemma l4Hl the 
probability of this event is at most 1/32A:. Applying the union bound, we obtain that Xi that is not isolated 
under qj with probability at most 1/16. Therefore, as the primes are picked independently, a mismatch Xi is 
not isolated under qj for all j with probability at most (1/16)'°®"* = 1/m^. Applying the union bound, we 
finally obtain that the probability of Ham(P, T)[i] / Xj is at most 2k/mf < Ijm?. □ 

We will answer a A:-mismatch query at alignment i by computing Xj. To allow us to compute Xj, we 
will maintain a number of data structures that can answer 1-mismatch queries on the subpattems. Given 
a pair {qj,r), a 1-mismatch query determines whether at the current alignment of PU'^ there is exactly 
one mismatch and if so, returns its location. By Lemma 1431 below, we can answer a 1-mismatch query in 
O(polylogm) time. 

Lemma 4.3. Given a pair (qj,r), a 1-mismatch query on the current alignment of P'^tx answered 

in O(polylogm) time. The required data structures use 0{k^ polylog m) total space and maintaining them 
takes O(polylog m) time when a stream update occurs. 
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We defer discussion of our method for answering 1-mismatch queries until after we explain how we 
use them to compute X,: First, we perform O(A: polylog m) 1-mismatch queries to find the set containing 
every {qj,r) such that subpattem has exactly one mismatch. Second, we look through every {qj,r) in 
the set and use the position of the mismatch in to determine the corresponding mismatching position 
in P. This set of mismatching positions is very likely to contain many duplicates because each position in P 
occurs in exactly one for each qj. Therefore, the third step is to remove any duplicates to recover X*. 
Finally we return Xj as the answer to the fc-mismatch query, unless X, > k, in which case we return “No”. 

The total space is 0{k‘^ polylog m) and the update time is O(polylogm) both of which are dominated 
by the space and maintenance time of the data structures required to support 1-mismatch queries. The time 
complexity for a A:-mismatch query is therefore 0(A: polylog m) and is dominated by the time taken to 
perform 0(A: polylog m) 1-mismatch queries, each taking O(polylogm) time. 

Proof of Lemma 14.31 We conclude this section by explaining our method for answering 1-mismatch 
queries which is based on a reduction to streaming dictionary matching. Given a set of patterns D, called a 
dictionary, the streaming dictionary matching problem is to find any occurrences of paffems in the dictionary 
in a text stream as they occur. We will use a recent streaming dictionary matching algorithm ifTSl which is 
randomised and uses 0{\D\ logm) space and takes O(log logm) time to process a stream update — i.e. 
arrival of a new symbol of T. 

The dictionary that we build is based on a second level of partitioning of the subpattems using the same 
partitioning scheme but with smaller values of q. For each (first-level) subpattem PU’^ there is a set of 
O(log^m) second-level subpatterns which we denote by From Theorem 1 in fTlX it follows that 

there are at least log m/ log log m primes in an interval [log m, 3 log m] and consequently the product of all 
primes in this interval is at least (logm)" = m. For each prime number p G [logm, 3logm] there is a 
second-level subpattem P^ G where q' = {qj ■ p) and r' = {qj ■ s) + r. We define fhe dicfionary 
D = [Jq, confaining all 0(fe polylogm) second-level subpaffems. 

Each subsfream is partitioned info second-level subsfreams in an analogous manner. We run fhe 
sfreaming dicfionary mafching algorifhm l(T5]l wifh dicfionary D on each second-level subsfream. Mainfain- 
ing fhese sfreaming dicfionary mafching algorifhms fakes O(polylog m) lime each lime an updafe occurs. 
This is because each arriving T[i] only occurs in 0(log m) second-level subsfreams. For each subsfream we 
use 0(/c polylog m) space. As fhere are 0(A; polylog m) subsfreams Ibis is 0{k‘^ polylog m) space in folal. 

Eel us now show fhaf a subpaffern P'H’'' confains an isolated mismafch if and only if for each prime 
there exists exactly one second-level subpattem that does not match. Indeed, if contains an isolated 
mismatch then the second half of the statement obviously holds. Assume now that for each prime there 
exists exactly one second-level subpattem that does not match and that there are at least two mismatches 
at positions 1 < x < y < |X^J ’'’| < m in the current alignment of P^i’’^. Eor all j the remainders of x, y 
modulo qj are defined by fhe index of fhe second-level subpaffern fhey belong fo (i.e. fhe unique subpaffern 
thaf does nof mafch) and fherefore are equal. As fhe producf of the primes qj is at least m, by the Chinese 
Remainder Theorem we have x = y, a contradiction. 

Therefore, to answer a 1-mismatch query on X^^ ’’’ it suffices fo defermine which of fhe second-level 
subpaffems in X|^ do nof mafch, or, equivalenfly, mafch exacfly af fhe lafesf alignmenf. Wifh fhe help of 
the dictionary pattern matching algorithm we can find all second-level subpaffems X'?*’'' thaf do nof mafch 
in O(polylog m) time. If for each prime fhere is exacfly one second-level subpaffern thaf does nof mafch, 
we can find fhe position of fhe mismafch in in 0(polylog m) fime as explained above. □ 


5 Proof of Theorem 11.31 - A small space (1 + e)-approximation 

In this section we give our (1 + e)-approximation for the streaming fc-mismatch problem. If e < l/{2k), 
we can just run the (1 + l/(2/c))-approximate algorithm. This only improves the time and space, but does 
not change the output as the (1 + l/(2/i;))-approximate algorithm exactly solves the /c-mismatch problem 
and therefore by the definition gives a (1 + e)-approximation. Below we assume e > 1/(2A:). We will also 
assume that m > ^ log^ m, otherwise 0{mle^) space will satisfy the conditions for Theorem 1 1.3 1 and we 
can simply apply the online version of Karloff’s (1 + e)-approximate algorithm ITdl . 

Our algorithm, .4Appr„x, will use the same partitioning of P and T into subpattems and substreams 
T'?’'" as in Section IH As before we will perform this partitioning for O(logm) values of q. However 
in contrast to Section |4] the range from which the primes are chosen will also depend on e. Specifi¬ 
cally, qi,q 2 ,, q\ogm are picked independently and uniformly at random from the primes in the range 
[| log^ m, ^ log^ m] where we set 5 = |. The subpatterns and substreams for qj then are given by 
and T®’'" for each r G [0, qj — 1]. 

In Section|4]we saw that for an arbitrary text substring T[i — m +1, i] we can find fhe Hamming disfance 
befween T[i — m+l,i] and P (if if is small) by finding every subpaffem thaf has exacfly one mismatch. 
We will now see that to approximate the Hamming distance it suffices fo counf fhe number of subpafferns 
pij^r jjjatch exacfly. For some alignmenf i, lef fiij denote fhe number of subpafferns P'^j’^ thaf 

do nof mafch exacfly and lef /i, = maxj Hij. Lemma [5TT] fells us fhaf if fhe Hamming disfance is small 
then is a good approximation of the true Hamming distance. As intuition for the proof techniques, first 
observe that fjnj is always upper-bounded by the true Hamming distance. The value of Hij underestimates 
the Hamming distance whenever two mismatches in P belong to the same subpattem P^j’^. Fortunately 
when the Hamming distance is relatively small, it is likely that for at least one prime qj, the effect of these 
collisions will be small. Lemma [5^ shows that if Ham(P, T)[i] is big, then /ij is big with high probability. 
We will consider 6 to be an arbitrary value between I/{6k) and 1/3. 

Lemma 5.1, /f Ham(P, r)[i] < 2k, then for all (1 — (5) • Ham(P, T)[i] < /ij < Ham(P, T)[i] with 
probability at least 1 — 

Proof. By definition, pi < Ham(P, T)[z] with probability 1. Recall that pi = max^jj, where pij is 
the number of subpatterns that do not match. The number of such subpatterns is at least the number 
Xjj of mismatches isolated under qj. Consequently, lij < (1 — <5) • Ham(P, T)[f] for all j. It implies 
that the number lij of mismatches that are not isolated under qj is at least 6 ■ Ham(P, T)[z]. On the 
other hand, E[Xjj] < ^ • Ham(P, T)[i] by Lemma l4Hl By Markov’s inequality, the probability of lij > 

6 ■ Ham(P, T)[i] is at most 1/16. As it holds for all j, the probability of Pi < {I — S) ■ Ham(P, T)[i] is at 

most (l/ 16 )^°sm ^ □ 

We now show that the Hamming distance is big, then pi is big with high probability. 

Lemma 5.2. IfYlAM{P,T)[i] > 2k then pi > {1 + 6) ■ k with probability at least 1 — 

Proof. Suppose that Ham(P, r)[f] > 2k and choose a subset M. of any 2k mismatches between P and 
T[i — m + l,i\. Remember that pi is the maximum number of subpattems that do not match in a partition 
for the current alignment. We say that a mismatch x is Ad-isolated under qj if it is the only mismatch from 
Ad that occurs in the current alignment of some subpattern If pi < {1 + 6) ■ k < |fc, then for all j 

there are at most jk subpatterns that do not match, and consequently there are at most mismatches that 
are Ad-isolated under qj. 

Assume that each mismatch x € Ad is Ad-isolated for more than | log m of the chosen primes. By 
summing over all mismatches in Ad, we have that Y/jh'io > \k\ogm, a contradiction. Consequently, 
there is at least one mismatch x G Ad that is not Ad-isolated for at least | log m of the primes. 
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By Lemma |4~T] and the union bound the probability that a mismatch x is not Ad-isolated under qj is at 
most 5/16. So, the probability of Ham(P, T)[i] > 2k is at most (5/16)!^°®"^ < □ 

As alluded to in Section [3l algorithm AlApprox performs two main phases. The first phase creates a set of 
2 log m length-reduced versions of the pattern during preprocessing and then performs a series of transfor¬ 
mations on the text as it arrives. There are two reduced patterns and two transformed texts for each of the 
0(log m) values of qj. The second phase then approximates the Hamming distance between each of the re¬ 
duced length patterns and the transformed texts. We will see that when combined these Hamming distances 
are a good approximation of /r* which is in turn a good approximation of the true Hamming distance. 


First phase. During the first phase, for each qj we perform a length reduction on P by constructing 
two new patterns, and each of length 0(| log^ m). To this end, we first compute an identifieo 
denoted for each subpattem PH’"^ such that has O(logm) bits and with high probability 

) if and only if PH’"^ = . For each qj, either all the subpatterns have the 

same length or there exists an Sj such that the subpatterns ..., pu^ij-^j-^ have equal lengths and 

the subpattems P‘i3^^i~^i ,..., pii’ii~^ which have length exactly one less. If the subpattems do have two 
different lengths, the two new patterns for prime qj are then given by 

and = (j){^P^3’'ii~^i)... We will proceed assuming that not all the subpatterns have the 

same length as if they do we can simply omit the parts of the algorithm that would otherwise use the second 
pattern. 

We transform the text as it arrives to form two new streams, and for each qj. To produce these 
new streams, for each substream T®’'" we run two instances of a dictionary matching algorithm ifTSl . one 
on dictionary Di = ^ and one on D 2 = , pti.tj-i}. For the latest 

alignment in the substream each dictionary matching instance returns the identifier of a subpattern 

from its dictionary (Pi or D 2 ) that currently matches (if there is onejl. Both instances use 0{qjlogm) 
space and 0(log log m) time per position and are correct with high probability. 

We use the output of the dictionary matching to form the streams, and for each qj. When a 
new symbol in T arrives, we will append one symbol to and one to . The arrival of a new symbol 
in T corresponds to a new symbol in one substream T®-'’ for each qj. If we find a new mafch of a patfem 
from Pi in we append its identifier fo Otherwise, we append $ to Analogously for P 2 , we 
find a mafch of a pattern from P 2 , we append its identifier to and otherwise we append $. This allows 
us to compute at alignment i as formalised by the following fact. 

Fact 5.3. For any alignment i and qj, we have that Hij = Ham((/^^ , C'®)[i — Sj] + Ham((/)® , C 2 ^)[i]. 

Proof. By definition, Ham((/)^^ , ) [i — Sj] equals the number of subpatterns from ..., 

that do not match at the current alignment, while , C' 2 ^)[i] equals the number of subpatterns among 

^j that do not match. □ 


Second phase. The second phase approximates the values of Ham ) [i — Sj] and H am , C '2 ) [i] 

for each qj as the stream arrives. We compute these approximate Hamming distances using an online 
variant llT4l of Karloff’s (1 -|- 5)-approximate pattern matching algorithm 1^ . Karloff’s algorithm requires 
5 to be bigger than the reciprocal of the pattern’s length. This condition is satisfied as 


1 

> — 
6k 


> 


3k log^ 


> 


1 


m 


f log^ m 


> max 


1 


1 


l</ri !</>: 


*For example, Karp-Rabin fingerprints (23 meet these requirements. 

^The streaming dictionary matching algorithm from im can easily be modified to return such an identifier. 
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The algorithm takes 0{^ log^m) space and time per output. We run two instances of the 

algorithm for each qj, one on the stream and the pattern cj)^, and other on stream and pattern (j )^2 ■ 

For the first algorithm, we store the last Sj < Qj outputs in a cyclic buffer. We can then compute Jlij, the 
sum of the approximate values of Ham((/)^-’ , ) [^ — Sj] and Ham(ij!) 2 ^ , ) [i] in 0(1) time per output. 

The maximum of the Jiij outputs over all j is an integer /Ij E [//j, (1 + (5) • which can be computed 
in O(logm) time per position. The algorithm returns “No” if /!* > (1 + 5) • fc and — (5) otherwise. 
The claim of correctness is given in Lemma [54l 

Lemma 5.4. For all ^ ^ if fit > (1 + §) • then Ham(P, T)[i] > k; otherwise, ffi/fi — |) is a 

(1 + e)-approximation o/Ham(P, T)\i]. The error probability is at most 

Proof. We use Karp-Rabin fingerprints fTBi as identifiers of fhe subpaffems. The probabilify thaf idenfifiers 
of fwo equal-lengfh subpaffems are equal can be made as small as 1/n^ by choosing a sufficienfly large 
prime. If implies fhaf fhe probabilify of compufing /q incorrecfly is af mosf (34fc/<5)^iog m ^ 

Assume thaf pi is compufed correcfly. If jli > {1 + 5) ■ k, then Ham(P, T)[i] > pi > Pi/{1 + (5) > k. 

Otherwise, Pi < Pi < {I + 6) ■ k, and from Lemma [5T] we obtain that Ham(P, T) [z] < 2k with probability 
at least 1 — l/(4m^). Finally, Lemma l5TT] also implies that Ham(P, r)[z] < pi/{l — 5) < Pi/{1 — 6) and 
Pi/{1 — S) < • ftj < (1 + e) • /Zi < (1 + e) • HAM(P,r)[z] with probability at least 1 — l/(4m^). The 

output is the integer [pi/{l — (5)J < Pi/{1 — <5) < (1 + e) • Ham(P, T)[i\. As Pi/{1 — 6) > Ham(P, T)[i] 
and Ham (P, r)[i] is an integer we have that [pi/{l — 6)\ > Ham (P,T)[z]. The claim follows. □ 

Time and space complexities. It suffices fo esfimafe fhe overall lime and space complexifies for fhe 
case where e > l/{2k) as for fhe smaller values of e we run a (1 + l/(2A:))-approximafe algorifhm. For 
one prime and one subsfream, fhe dictionary paffem mafching algorifhm uses 0[{k/6) log^ m) space as the 
dictionary will contain 0[{k/6) log^ m) subpatterns. In total, all the dictionary pattern matching algorithms 
combined use 0((A:^/5^) log® m) = Oifykf /e^) log® m) space as we have 0(log m) primes for each of the 
0{{k/5)\o^ va) substreams. We also require 0((A:/e^) log® m) space to run all O(logm) copies of the 
online version of Karloff’s (1 + ti)-approximation algorithm. This is because each subpattem is of length 
0{{k/e) log^ m) (recall that 5 = e/3). Despite this the overall space complexity is not affected by running 
Karloff’s algorithm. This is because if e > 1/2A; then the space is dominated by 0[{k‘^/e^) log® m). 

Each symbol of T is added to only one of the substreams for each j. For each of them we 
update the dictionary matching algorithms, which takes O(logmloglogm) time. Next, for each of the 
O(logm) updated streams we give one output of the online version of Karloff’s algorithm, which takes 
0(log® = 0(log® mje^) time in total. This completes the proof of Theorem 1 1.3 1 

6 Proof of Lemma |3^ - The small approximate period case 

We now give a proof of Lemma 13.31 which states that if the 3A:-period of P is smaller than k, then the 
fc-mismatch pattern matching problem can be solved in 0{k‘^) space and 0{nk'^ log k/m + n) time. The 
discussion follows with reference to the steps of Algorithm [T] which is given in Section |3] 

Our algorithm utilises a simple variant of run length encoding. We will use this encoding to reduce the k- 
mismatch problem to a total of 0(A:^) small instances of the run length encoded Hamming distance problem. 
Each instance will process a pattern/text pair each containing 0{k) runs. By using a streaming variant of an 
existing run length encoded Hamming distance algorithm, we will be able to output the Hamming distances 
for each of these instances in a compressed format in a total of 0{k‘^ log k) time. The original Hamming 
distances can then be recovered in a streaming fashion by summing the outputs of the run length encoded 
instances. 
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Run length encoding using the 3A:-period. We begin by describing the variant of run length encoding that 
we will use and argue that all the information about the pattern and text that we need to answer ^-mismatch 
queries can be encoded in 0{k) space. Let i < kbe the 3A:-period of P. We partition the pattern and the text 
as described in Section |4] except that instead of choosing a random prime, we use the fixed value £ instead. 
Recall that for an arbitrary string S, the partition is defined fo be equal 5[r]S'[£ + r]S'[2£ + r] ... up unfil 
fhe end of S. As £ is fixed for fhis section, we will shorfen fhe nofafion fo insfead. The £-v\m lengfh 
encoding of a siring S is defined as fhe ordered sef of all S"”, each sfored in run lengfh encoded form, where 
r E [0, f — 1]. We denofe by runs(S'^) fhe number of runs in S"^. The size of fhe encoding, denofed runs£(5) 
is J2t=o i'uns(S'''). We begin wifh an example of fhe encoding. The whifespace in P in fhe example has only 
been included for visual clarify. 

Example 6.1. Let P = aab aab aab aab aab aab aac and fc = 4. The 3k-period of P is £ = 3. We then 
have that, P^ = aaaaaaa, P^ = aaaaaaa, P^ = bbbbbbc. The £-run length encoding of P is: the run 
length encoding (a, 7) of P^, the run length encoding (a, 7) of P^, and the run length encoding (b, 6)(c, 1) 
of P^. The size of the encoding , runs^{P) = 1 + 1 + 2 = 4. 

Our firsl observation is fhaf for a pattern wifh small approximate period, ifs f'-run lengfh encoding is also 
small. Infuifively fhis is because a pattern wifh small approximate period almost repeals every £ symbols. 

Lemma 6.2. If P has 3k-period at most k then runsf{P) < Ak. 

Proof We have lhal HAM(P[f', m — 1], P[0, m — 1 —£]) < 3k. Lei h = HAM(P[f’, m — 1], P[0, m — 1 — £]) 
and letP = { 11 , 12 , ■ ■ ■ ih] be fhe sef of locations of fhe mismalches in P[0, m—l—£]. For all i E [£, m—l]\L 
we have lhal P[i—£] = P[i\. Furthermore lelXf be fhe subsel of P conlaining indices {f E P | f = r mod £}. 
Observe lhal for r, r' E [0,^—1] wifh r / r', we have lhal P^ and are disjoin!. Recall lhal P[i — £] = 
P[i\ for all i E [f", m — 1] \ P. If we rephrase fhis in terms of P*", we have lhal P''[q — 1] = P'^[q] if 
{q£ + r) E [£,m — 1] \ Zr. Since fhe number of runs in P^ is equal fo fhe number of non-equal neighbouring 
symbols plus one, fhe number of runs in P*” is at most |Pr| + 1. By summing over all r, we have that 
runsf (P) <3k + £< 4k. □ 

The second observation is that there is a substring of T which we call T* which compresses well and 
contains every alignment with at most k mismatches with the pattern. Intuitively this substring compresses 
well because it is very similar to the pattern, which in turn compresses well. Let us define Tl fo be fhe 
longesl suffix of r[0, m — 1] for which runs£(ri) < bk and Tr fo be fhe longesl prefix of r[m, 2m — 1] for 
which runsf (Tr) < bk. We define T* = PlPr. It follows directly that runs£(r*) < lOfc. 

Lemma 6.3. P* completely contains every T[i — m + l,i] such that Ham(P, T)[i] < k. 

Proof. Let il be the smallest integer such that Ham(P, T)+ m — 1] <k and let iji be the largest integer 
such that Ham(P, r)[fR] < k. Obviously, T[iL,iji] completely contains every T[i — m + l,i\ such that 
Ham(P, T)[i\ < k. 

To show that T* contains P[fL, in] it suffices fo show lhal fhe run lengfh encodings of T[iL,m — 1] and 
r[m, in] have size al mosl bk. To see lhal mnsi{T[iL,m — 1]) < bk, consider alignmenl zr + m — 1. As 
Ham(P, r)[zR + m — 1] < k and m — 1 < zr + m — 1, we have lhal P differs from T[iL,iL + m — 1] in 
al mosl k positions. However, we have jusl shown lhal runs£(P) < 4k. Consider fhe run lengfh encoding 
of P'" and fhe encoding of T^. If Ihere is a run in fhe encoding of T’’ which ends al some r[zR + j] buf 
Ihere is no run ending al P[j], then this must be the position of a mismatch. Therefore the number of 
these additional runs is at most k. Furthermore, we have that P[j] is such that j = r mod £. Therefore 
the mismatch P[j] cannot cause an additional run in any T’’ with r' / r. We therefore have that by 
summing over all r, the total number of runs, runs^(r[zR, zr + m — 1]) is at most runs£(P) + k < bk. 
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Finally we observe that the encoding of a prefix is no larger than the encoding of the original. That is, 
runs£(T[ii,m — 1]) < + m — 1]) < hk. An analogous argument allows us to prove that 

runs£(T[m, ij^]) < 5A:. □ 

Run length encoded Hamming distance. Before we explain the full algorithm in more detail, we first 
introduce the algorithm .Arle- The algorithm A.rle is a straightforward adaptation of the offline algorithm of 
Chen et al. |[T2ll . which computes Hamming distances between run length encoded text and pattern, to the 
streaming setting. 

We briefly explain the overall approach of Chen et al.’s algorithm ifT^ . Consider a text T' and a pattern 
P' both in the run length encoded form. Let D be an m x n matrix where D[i,j] equals one if P'[j] / T'[i\ 
and equals zero otherwise. The Hamming distance between P' and T'[i — m + I, i] is exactly the sum of 
the entries along the f-th diagonal of D. The z-th diagonal is the one which intersects cells D[i — m + 
1,0] and D[i,m — 1]. The first observation that Chen et al. make is that the matrix D can be composed 
into 0(runs(P') • runs(T')) monochromatic rectangles. These rectangles are exactly given by dividing D 
horizontally whenever P'[j] / P'\j — 1] and vertically whenever T'[i] / r'[z — 1]. For 1 < i < \P'\, they 
define A[z] to be the difference between the Hamming distance at alignments i and (z — 1). Formally, 

A[z] = HAM(P',r')[z] - HAM(P',r')[z - 1] 

Further they observe that if the z-th diagonal does not intersect any comers then A[z] = A[z — 1]. In an 
offline setting, the values of A[z] such that A[z] / A[z — 1] (and hence the values of Ham(P', r')[z]) can 
be found by sorting these comers and processing them in the order that they intersect the z-th diagonal as z 
increases. 

We begin by briefly explaining how the input and output have been adapted for our streaming setting. 
The .Arle algorithm consists of two alternating operations, NewRun(z, cj) and Diff(z). The input to A.rle is 
supplied via the NewRun(z, a) operation which informs algorithm .Arle that a new run starts at r'[z] = a. 
Each NewRun(z, cj) operation triggers Diff(z) operation. 

Operation Diff(z) produces an output of the algorithm. Diff(z) returns three values: a pair (A[z], z*), 
where z < z*, and Ham(P', T')[z]. Next DlEE operation will be called at next NewRun operation or 
at T[z*], whichever comes first. It is guaranteed that if no NewRun occurs during T'[z,z*] then A[z] = 
A[z + 1] = ... = A[z* - 1]. 

We now explain how the operations NewRun and Diee are supported. We maintain a diagonal line 
which moves from left to right as NewRun and Diee operations occur. When either NewRun(z, a) or 
Diff(z) is performed, the diagonal line moves forward to the z-th diagonal. Any corners of rectangles in D 
that are crossed by the movement of the line are processed in order. This is achieved using a priority queue 
containing currently unprocessed comers (sorted by the order that the comers intersect the z-th diagonal). 
As all points which are to the left of or are currently on the z-th diagonal have been processed by the 
end of Diee(z), both A[z] and Ham(P', T')[z] can be outputted by following the approach of Chen et al. 
Following the discussion above, any NewRun operation corresponds to a new vertical line in D. This 
introduces 0(runs(P')) rectangles and hence 0(runs(P')) new comers. These points are pushed into the 
priority queue when NewRun operation occurs. Finally for any Diee(z) operation we also need to output i*, 
where i* > z is the smallest integer such that there is a corner currently in the priority queue which intersects 
diagonal i*. We can find this value with the help of the priority queue. Observe that the number of distinct i* 
outputted by the algorithm over all Diff(z) operations is upper-bounded by the number of comers which 
is 0(runs(P') • runs(T')). This property is required when we use the algorithm to limit the number of 
Diff(z) operations required. We now summarise the space and time complexities of the A.rle algorithm in 
Lemma [6A1 
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Lemma 6.4. Given a run length encoded pattern P' and text T', the algorithm Arle solves the Hamming 
distance problem in 0{runs{P')) space. The amortised time complexity o/NewRun or DiFF operation is 
0{runs{P') \og{runs{P'))) or 0{\.og{runs{P'))) respectively. No preprocessing is needed. 

Proof. The space complexity follows from Chen et al. who observe that the size of the priority queue is 
0(runs(P')) at any time. The whole of P' can be stored in 0(runs(P')) space. Only the latest symbol of T' 
is required. 

Recall that the time complexities are amortised over all NewRun and DiEE operations performed so 
far. The number of points inserted into the priority queue is 0(runs(P')) per NewRun performed. A cost 
of 0(runs(P') log(runs(P'))) is charged to the NewRun which inserted them. This pays for processing 
them during any subsequent NewRun or DiFF operations. The amortised time complexity of NewRun 
operation is therefore 0(runs(P') log (runs (P'))) because priority queue operations take 0(log(runs(P'))) 
time. Similarly, the amortised time complexity of the DiEE operation is 0(log(runs(P'))). □ 

The A:-mismatch algorithm. We now give our full algorithm for the /c-mismatch problem in the small 
approximate period case. Recall that in this section we assume that |r| = 2m. The algorithm performs 
three phases, Setup, Handover and Output depending on the value of i when T[i\ arrives. The symbol 
r[m — 1] is processed by all three phases (in ascending order) and is the only symbol processed by the 
Handover phase. 

Setup phase: {i < m — 1). We maintain a modified £-run length encoding of the longest suffix Tl of 
the current text r[0,z] such that runs£(ri) < Sk (see Lemma lOI) . More formally, we maintain for each 
r G [0,^ — 1] a linked list of tuples (y, T[j]), where j are the starting positions of runs in T£ for s = 
ii + r mod 1. We also maintain the length of each list and the total length of all lists. 

Handover phase: {i = m — 1). We compute the £-run length encoding of Tl and then start instances 
of .4 rle. For each (r, s) G [0,£ — 1]^, the instance denoted s) uses pattern P'’ and text r£ , where 

s' + m — \Tl\ = s mod i. A sequence of NewRun operations are performed immediately on yfRLE(r, s) 
to provide the whole of the run length encoding of T[ as text input. The NewRun operations are offset to 
account for the start of within T". Specifically, for each T[[i'] / T[[i' — 1] we perform NEWRUN(f' + 
l{m-s)/£\-\T[\,T[[i']). 

Output phase: {i > m — 1). We perform four steps: 

1. First, we check whether T[i] starts a new run in T® where s = i mod £. If so for each r G [0, £ — 1], 
we perform NewRun([z/£J ,r[i]) on instance ylRLE(r,s). Recall that every NEWRUN([t/£J,T[i]) 
operation also triggers a DlFF([t/£J) operation. 

2. Second, for each r G [0, £ — 1] we compute ] - the value of A[[i/£J ] for instance ^RLE(r, s) 

where s = i mod £. To this end we determine the set of all r G [0,£ — 1] such that = \i/£\. 
Here i* ^ is the i* value outputted by the last DiFE operation performed on ylRLE(r, s). For every such 
.4RLE(r, s) we perform DlEE([i/£J) to compute Ar,s[[t/^J] and then update i* g. For all other (r, s), 

we have that Ar,s[LV'^J] = ^r,s[[i/^\ — 1 ]- 

3. Third, we check whether the total number of runs processed by all ^Irle instances exceeds Sk. If 
so, all .Tree instances are abandoned and we output “No” for this and every subsequent value of i in 
[m — 1, 2m — 1]. 

4. Finally, we compute the latest Hamming distance, Ham(P, r)[i] from Ham(P, r)[z — £] and the 
outputs of the .ARLE(r’, s) using the equations from Lemma l6(6] and Lemma 16771 as described below. 
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All steps of the algorithm are self-explanatory, except for the Setup phase and the fourth step of the 
Output phase, which we describe in details below. We start by giving a lemma that will allow us to compute 
Ti (the Setup phase). 

Lemma 6.5. Given the modified (.-run length encoding of S = r[ii, ^ 2 ], the modified (-run length encoding 
of either T[ii + 1 ^ 2 ] or T[iifi 2 + 1] can be computed in 0(1) time. 

Proof To compute the encoding of T[ii + 1 ,^ 2 ], we go to the (ii mod £)-th list. The first two tuples in 
this list define the length of the first run in 5^*^ If it equals one, we delete the first tuple and then 

decrement the length of the list and the total length of the lists by one. Otherwise, we simply replace the 
first tuple by (ii + T[ii + (\). 

To compute the encoding of T[ii, ^2 + 1], we go to the {{i 2 + 1) mod ()-th list. The last tuple in the list 
defines whether T[i 2 +1] starts a new run in If it does, we add a new tuple (z 2 +1, T[i 2 +1]) 

to the list and increment the list’s length and the total length by one. Otherwise, we do nothing. □ 

We now give two lemmas which combined will allow us to efficiently compute the final Hamming 
distances (the fourth step of the Output phase). Note that the .Arle instances collectively process the substring 
T* as defined in Lemma lh^ Let T* = T[if, i'p\. (Recall that T* contains T[iLfiB\ but does not necessarily 
equal it). Remember that for any i ^ [i'j^ -\- m — we have that Ham(P, T)[f] > k. For the first ( 

alignments in [if + m — 1, we use Lemma 1631 to calculate the output directly from the .Tree outputs. 

Lemma 6.6. For any i G [if-\-m — 1, we have that 

l-i 

HAM(P,r)[z] = J;HAM(P^^^(^’'))[Q(r,f)], 
r=0 

where R{r, f) = (r + f — m + 1) mod ( and Q{r, i) = _)_ ip*"! _ 1 . 

Proof In the alignment of P and T[i — m-\-l,i] we have that P*" is aligned against T[i — m + l + r]T[f — 
m + 1 + r + £] ... T[i — m-\-l-\-r-\-( - (|P’’| — 1)]. The claim follows. □ 

For the remaining alignments we use Lemma 167/1 We will compute Ham(P, T) [i] from Ham(P, T) [/ — 
f\ and A^[/], where A^[i] = The value of A^[i] will in turn be computed from 

A^ [i — (] by updating only the terms which have changed. We will argue below that these terms change very 
rarely. 

Lemma 6.7. Ham(P,T)[/] - HAM(P,r)[z - (] = Er=o *)] 

Proof First consider Lemma [631 with i substituted for i — (. We have that, 

l-i 

HAM(P,r)[i - (] = ^HAM(P'',T^(''’*-^))[Q(r,i - ()] 
r=0 

It follows from the definitions of R and Q that R{r, i — £) = R{r, i) and Q{r, i — () = Q(r, i) — 1. This 
therefore simplifies fo 


i-i 

Ham(P,T)[/ - (] = ^HAM(P^r^('’’*))[g(r,z) - 1]. 

r=0 


We therefore have that Ham(P, T) [/] — Ham(P, T) [i — (\ equals 
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i-l 

Y, (HAM(P^^«(^’*))[Q(r,^)] -HAM(P^T«(^’*))[Q(r,^) - 1]) . 

r=0 

From the algorithm description it then follows that, 

The claim follows immediately via substitution. □ 


Space complexity. We now establish that the space complexity of the A:-mismatch pattern matching al¬ 
gorithm is 0{k‘^) as stated in Lemma [33] The space required to store P in the £-run length encoded form 
as well as the suffix is 0{k) by definition. To compute the latest Hamming distance we store the most 
recent I Hamming distances as well as the last two outputs from each DiFF operation on each instance. 
Only these DiFF outputs are required because Q{r, i) G — 1, \j/(-\] as we show in Lemma 1631 

Lemma6.8. Q{r,i) G [\i/l\ — 1, \ i/(-\]- 


Proof Finally we demonstrate the observation that Q{r,i) G [\i/(-\ — 1, Substituting in the length 

of P'' we have that Q{r^ i) equals + 1) — 1. Further, 


i 

-1 < 




r + i — m + 1 


+ 


m — r — 1 


< 


□ 


As there are different ^rle instances, this is O(fe^) space. Finally we have to account for the working 
space of the .4 rle instances. For any fixed s G [0,^ — 1] fhe space used by all .4 rle(?’, s) insfances is 
~ 0{k), which is 0{k?‘) space over all s. Therefore, fhe space complexify is 0{k‘^) overall 

as claimed. 


Time complexity. Finally, we show fhaf fhe lime complexify of fhe /c-mismalch pattern malching algo- 
rifhm is 0{nk‘^ log /c/m + n). The lime complexify of fhe Selup phase is 0(1) lime per symbol, or 0{m) 
lime overall, by Lemma 16.51 The Handover phase sfarfs by computing fhe .^-run lenglh encoding of Tl 
from fhe modified encoding mainlained Ihrough fhe Selup phase, which can be done in 0{k) time. If Ihen 
performs fhe initialising NewRun operafions on fhe ,Arle insfances. The lolal lime complexify for all oper¬ 
ations on fhe ^RLE insfances will be accounted for below. 

The Oulpul phase is splil info four sleps. The firsl sfep is also dominated by fhe NewRun operafions 
on fhe ,4 ,rle insfances. The second sfep can be implemenfed so lhal fhe time complexify is dominaled by 
fhe Diff operafions performed. In parlicular we need lo avoid spending 0{i) lime lo check whelher each 
r G [0, f" — 1] has i*^ = [i/^J . For each s we mainlain a sorted linked lisl of fhe currenl values of each i* 
We can Ihen find all i*^ = [i/i\ in lime proporlional lo fhe number of such i* g which in lum is equal lo fhe 
number of Diff operations performed. The third step takes 0(1) time per symbol via a simple counter, i.e. 
0(m) time in total. 

Finally, we discuss the fourth step of the Output phase. To compute the Hamming distances for i G 
i — 1], we apply Lemma 1631 This takes 0{f) time per symbol which is 0{f^) = 0{k‘^) time 
in total. For the remaining Hamming distances we apply Lemma [63] This would take 0{f) as well if we 
applied it directly. To avoid this, we compute the value of A^[f] from the value of A^[f — £] by determining 
which terms have changed and updating them. 

Fact 6.9. A^[i] = ^r,R(r,i-i)iQip i-^) + Ij- 
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Proof. From the definitions of i? and Q we have that i?(r, i) = R{r,i—€) wAQ{r,i) = Q{r,i—£) + l. □ 

On the other hand, A^[i — P\ = X]r=o “ ^)] by definition. By storing the most re¬ 

cent values for all (r, s) (see Lemma [ 6 ? 8 l ). it is straightforward to determine which terms have changed 
in time proportional to the number of terms that have changed. Furthermore, for ii 7 ^ ^2 mod i and 
r G [0,£ — 1], we have that R{r,ii) 7 ^ i?(r, Z 2 ). Consequently, for any {r,s,j), there is at most one 
value of i such that Ar,s[j] appears as a term in the expression for A^[f]. Therefore the total time complexity 
for step four is upper-bounded by the number of (r, s,j) such that Ar^sU) / ^r,s{j — 1)- This is in turn 
upper-bounded by the total number of NewRun and DiFF operations performed. 

Remember that the total number of NewRun and Diff operations performed by all instances of 
is at most 0(runs(P) • runs(r*)) = 0{k‘^). Therefore, the total time complexity is 0{m -|- k‘^) excluding 
the time taken to perform the NewRun and Diff operations. It remains to give an upper bound on the 
total number of these operations for each ,4.rle- For a given (r, s), the number of NewRun operations on 
Ale(?’, s) is 0 (runs(r*)). 

The total time spent performing NewRun and Diff operations on ,4 .rle(?’, s) is therefore 0(runs(P’’) • 
log (runs (P'’)) • runs(T^')). Summing over all Ale instances, and simplifying, we have that 

0(runs(P^) • runs(T*) • log k) = O i ^^runs(P’') • runs(T*) • log A: j = 0{k‘^ log k). 

r,s \ r s / 

Therefore the total time complexity of the entire algorithm is 0(m -|- k‘^ log k). It is important for the 
deamortised algorithm we give in Theorem 11.41 (which uses this algorithm as a black box) that if m > 2k‘^ 
then for processing any k"^ consecutive text symbols we spend only 0{k^ log k) time as the term m in the 
time complexity comes from spending 0 ( 1 ) time per symbol in the worst case. 
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